Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conditionally and persisitently rare taxa #98

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from
Open

Conversation

Daenarys8
Copy link
Contributor

ping #83

Signed-off-by: Daena Rys <[email protected]>
R/addMetrics.R Outdated Show resolved Hide resolved
R/addMetrics.R Outdated Show resolved Hide resolved
R/addMetrics.R Outdated Show resolved Hide resolved
R/addMetrics.R Outdated
Comment on lines 54 to 77
# Calculate metrics for each split subject/group
results_list <- lapply(split_data, function(sub_data) {
# Initialize a list to hold results for each metric
metric_results <- list()

# Loop through each metric and calculate it
for (metric in metrics) {
# Check if the metric is supported and calculate accordingly
if (metric == "CRT") {
crt_results <- .calculateCRT(sub_data, thresholds[["CRT"]])
metric_results[["CRT"]] <- crt_results
} else {
stop(paste("Metric", metric, "is not supported yet."),
call. = FALSE)
}
}

# Add the calculated metrics to the metadata (colData or rowData)
for (metric in names(metric_results)) {
colData(sub_data)[[metric]] <- metric_results[[metric]]
}

return(sub_data)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more efficient to:

  1. Get matrix
  2. Get the list which has vectors as elements. Each vector represent members of each group
  3. Loop through this list
  4. Select the group members from the matrix
  5. Calculate values
  6. Combine all the results and sort values based on sample names
  7. Add to colData (in add* method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced the right location should be colData as the interest is taxa related.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I followed the code and did not think about it. As it these features are global (for whole dataset), colData is not correct place indeed

I think the output could be similar to getPrevalent and getRare, i.e., vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. yes. If we have a single subject/group with time series, then this is a vector over features (or rowData) - one measure of persistence or rarity for each feature and rowData would be a suitable location.

If we have multiple subjects/groups with time series, then this would be features x groups matrix.

At simplest it could be implemented for just a single-group case, and rest left for the user.

If we support multi-group (multi-subject) analyses, then that is a more general design question and linked to #30

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one vector could be calculated for whole dataset for now. As you mentioned, the multi-group thing is more general question, and is affecting also the mia functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - note that this can only be calculated in the case where the whole data set has some continuous variable that extends all samples (like is the case when there is time dimension for a single subject). If we have multiple groups then this is not defined for the whole data set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed something, but I cannot see direct linkage of time series to this concept of conditionally/persistent rare/abundant taxa. It seems that this is more general concept?

Based on the following article, I summarized the concept

Sizhong Yang, Matthias Winkel, Dirk Wagner, Susanne Liebner, Community structure of rare methanogenic archaea: insight from a single functional group, FEMS Microbiology Ecology, Volume 93, Issue 11, November 2017, fix126, https://doi.org/10.1093/femsec/fix126

Rare taxa:

  • Average relative abundance below 1% accross all samples

Abundant taxa:

  • Average relative abundance above 1% accross all samples

Conditionally rare taxa:

microbial taxa that are typically in low abundance in one locality but occasionally become prevalent in at least one other sample

  • CRT was assigned if an OTU exhibits a maximum relative abundance at least 100 times higher than its minimum value (max:min > 100)

Permanently rare taxa:

  • If the (max/min) ratio is less than 5, that is the abundance rarely fluctuates among samples, it will be grouped into the permanently rare taxa (PER)

Rare and abundant taxa can be already calculated by mia functions:

getRare(tse, assay.type = "relabundance", detection = 1/100)
getPrevalent(tse, assay.type = "relabundance", detection = 1/100)

I see that these conditionally rare and permanently rare taxa are very close to getRare() and getPrevalent(). As they do not necessarily deal with time series, should these methods be in mia instead: getConditionallyRare() & getPermanentlyRare()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I mixed it up, no time series needed.

CRT seems like a bit strange concept in the end. I am not sure if there should be much difference whether the minimum count is 1, 2, or 3 reads but this can make a huge difference in the max:min calculation. But yes, it might be relevant to recognize taxa that have vast variation. Variance would not capture all these cases, if the blooms cover only few samples. Shouldn't this be called conditionally abundant, rather than conditionally rare..?

PER: is the rarity also evaluated, or just max/min ratio? some taxa might be abundant and prevalent, while having a low max/min ratio. At least in principle. Then it would be misleading to call them PER. The interesting PER are taxa that might be prevalent but always stay under a certain threshold and never bloom.

We could add the methods as suggested and it could go to mia. If this could be made into a more generic index function (like addAlpha for samples), that could help to simplify maintenance and adding possible other measures later. Would that be feasible..?

On the other if this seem time-consuming to finalize, it could be left for later.

Signed-off-by: Daena Rys <[email protected]>
Signed-off-by: Daena Rys <[email protected]>
@TuomasBorman
Copy link
Collaborator

I read this article that Leo shared in the issue: https://journals.asm.org/doi/10.1128/mbio.01371-14

In this article, conditionally rare taxa is determined based on coefficient of bimodality. They determined CRT for each system from where multiple time points were collected.

How should we proceed with these? CRT is determined with two different ways.

Proposal:

In mia, introduce new funtions:

getConditionallyRare() and getPermanentlyRare().

These functions would return a vector of taxa (only those ones that are classified as CRT or PRT). Similar to getRare() and getPrevalent().

These would follow the implementation introduced in this article

Sizhong Yang, Matthias Winkel, Dirk Wagner, Susanne Liebner, Community structure of rare methanogenic archaea: insight from a single functional group, FEMS Microbiology Ecology, Volume 93, Issue 11, November 2017, fix126, https://doi.org/10.1093/femsec/fix126

Conditionally rare:

(max:min > 100), where max and min are maximum and minimum abundances

Permantently rare:

(max:min < 5)

In mia, implement getPrevelanceClass() function that returns a class for each taxa (prevalent, CRT, PRT or rare).

In miaTime, implement getBimodalityCoefficient(). For each system/group/patient and taxa, return bimodality coefficient.

The method would return either a vector length nrow or data.frame with dimensions nrow*N_system

This method would follow this implementation:

Shade A, Jones SE, Caporaso JGHandelsman J, Knight RFierer N, Gilbert JA2014.Conditionally Rare Taxa Disproportionately Contribute to Temporal Changes in Microbial Diversity. mBio5:10.1128/mbio.01371-14.https://doi.org/10.1128/mbio.01371-14

@antagomir

@antagomir
Copy link
Member

There are different multi/bimodality indices available, here one implementation:
https://cran.r-project.org/web/packages/BimodalIndex/BimodalIndex.pdf

@antagomir
Copy link
Member

Interesting, I had not paid attention to the multiple definitions.

Definitions in (1-2) are somehow arbitrary, and relatively easy calculate manually in whatever desirable way. I am hesitating if we need full methods for these in the end, before there is a clear need / more use cases. On the other hand this PR is already half-way..

Bimodalities are interesting, too. I'm good with that if it is quick to add. It could make sense to check what uni/bi/multimodality indices are available in R. There are a couple of different methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants